102 research outputs found
LIFT: Learned Invariant Feature Transform
We introduce a novel Deep Network architecture that implements the full
feature point handling pipeline, that is, detection, orientation estimation,
and feature description. While previous works have successfully tackled each
one of these problems individually, we show how to learn to do all three in a
unified manner while preserving end-to-end differentiability. We then
demonstrate that our Deep pipeline outperforms state-of-the-art methods on a
number of benchmark datasets, without the need of retraining.Comment: Accepted to ECCV 2016 (spotlight
Neural Fourier Filter Bank
We present a novel method to provide efficient and highly detailed
reconstructions. Inspired by wavelets, we learn a neural field that decompose
the signal both spatially and frequency-wise. We follow the recent grid-based
paradigm for spatial decomposition, but unlike existing work, encourage
specific frequencies to be stored in each grid via Fourier features encodings.
We then apply a multi-layer perceptron with sine activations, taking these
Fourier encoded features in at appropriate layers so that higher-frequency
components are accumulated on top of lower-frequency components sequentially,
which we sum up to form the final output. We demonstrate that our method
outperforms the state of the art regarding model compactness and convergence
speed on multiple tasks: 2D image fitting, 3D shape reconstruction, and neural
radiance fields. Our code is available at https://github.com/ubc-vision/NFFB
Learning to Find Good Correspondences
We develop a deep architecture to learn to find good correspondences for
wide-baseline stereo. Given a set of putative sparse matches and the camera
intrinsics, we train our network in an end-to-end fashion to label the
correspondences as inliers or outliers, while simultaneously using them to
recover the relative pose, as encoded by the essential matrix. Our architecture
is based on a multi-layer perceptron operating on pixel coordinates rather than
directly on the image, and is thus simple and small. We introduce a novel
normalization technique, called Context Normalization, which allows us to
process each data point separately while imbuing it with global information,
and also makes the network invariant to the order of the correspondences. Our
experiments on multiple challenging datasets demonstrate that our method is
able to drastically improve the state of the art with little training data.Comment: CVPR 2018 (Oral
LF-Net: Learning Local Features from Images
We present a novel deep architecture and a training strategy to learn a local
feature pipeline from scratch, using collections of images without the need for
human supervision. To do so we exploit depth and relative camera pose cues to
create a virtual target that the network should achieve on one image, provided
the outputs of the network for the other image. While this process is
inherently non-differentiable, we show that we can optimize the network in a
two-branch setup by confining it to one branch, while preserving
differentiability in the other. We train our method on both indoor and outdoor
datasets, with depth data from 3D sensors for the former, and depth estimates
from an off-the-shelf Structure-from-Motion solution for the latter. Our models
outperform the state of the art on sparse feature matching on both datasets,
while running at 60+ fps for QVGA images.Comment: NIPS 201
TILDE: A Temporally Invariant Learned DEtector
We introduce a learning-based approach to detect repeatable keypoints under
drastic imaging changes of weather and lighting conditions to which
state-of-the-art keypoint detectors are surprisingly sensitive. We first
identify good keypoint candidates in multiple training images taken from the
same viewpoint. We then train a regressor to predict a score map whose maxima
are those points so that they can be found by simple non-maximum suppression.
As there are no standard datasets to test the influence of these kinds of
changes, we created our own, which we will make publicly available. We will
show that our method significantly outperforms the state-of-the-art methods in
such challenging conditions, while still achieving state-of-the-art performance
on the untrained standard Oxford dataset
Layered Controllable Video Generation
We introduce layered controllable video generation, where we, without any
supervision, decompose the initial frame of a video into foreground and
background layers, with which the user can control the video generation process
by simply manipulating the foreground mask. The key challenges are the
unsupervised foreground-background separation, which is ambiguous, and ability
to anticipate user manipulations with access to only raw video sequences. We
address these challenges by proposing a two-stage learning procedure. In the
first stage, with the rich set of losses and dynamic foreground size prior, we
learn how to separate the frame into foreground and background layers and,
conditioned on these layers, how to generate the next frame using VQ-VAE
generator. In the second stage, we fine-tune this network to anticipate edits
to the mask, by fitting (parameterized) control to the mask from future frame.
We demonstrate the effectiveness of this learning and the more granular control
mechanism, while illustrating state-of-the-art performance on two benchmark
datasets. We provide a video abstract as well as some video results on
https://gabriel-huang.github.io/layered_controllable_video_generationComment: This paper has been accepted to ECCV 2022 as an Oral pape
- …